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ABSTRACT 

The rate of occurrence of words is not uniform but varies from 
document to document. Despite this observation, parameters for 
conventional n-gram language models are usually derived using 
the assumption of a constant word rate. In this paper we investi- 
gate the use of variable word rate assumption, modelled by a Pois- 
son distribution or a continuous mixture of Poissons. We present 
an approach to estimating the relative frequencies of words or n- 
grams taking prior information of their occurrences into account. 
Discounting and smoothing schemes are also considered. Using 
the Broadcast News task, the approach demonstrates a reduction 
of perplexity up to 10%. 

1. INTRODUCTION 

In both spoken and written language, word occurrences are not 
random but vary greatly from document to document. Indeed, the 
field of information retrieval (IR) relies on the degree of depar- 
ture from randomness as a discriminative indicator. IR systems 
are typically based on unigram statistics (often referred to as a 
"bag-of-words" model), coupled with sophisticated term weight- 
ing schemes and similarity measures [jjj]. In an attempt to mathe- 
matically realise the intuition that an occurrence of a certain word 
may increase the chance that the same word is observed later, sev- 
eral probabilistic models of word occurrence have been proposed. 
Much of this work has evolved around the use of (a mixture of) 
the Poisson distribution ^, ^]. Recently, Church and Gale have 
demonstrated that a continuous mixture of Poisson distributions 
can produce accurate estimates of variable word rate [Qj. Lowe 
has introduced a beta-binomial mixture model which was applied 
to topic tracking and detection [g], 

Although a constant word rate is an unlikely premise, it is 
nevertheless adopted in many areas including n-gram language 
modelling. In order to address the problem of variable word rate, 
several adaptive language modelling approaches have been pro- 
posed with a moderate degree of success. Typically, some notion 
of "topic" is inferred from the text according to the "bag-of-words" 
model. Information from different language model statistics (e.g., 
a general model and/or models specific to each topic) are then 
combined using, methods such as mixture modelling Wlu or max- 
imum entropy [Q]. The dynamic cache model ^ is a related ap- 
proach, based on an observation that recently appearing words are 
more likely to re-appear than those predicted by a static n-gram 
model. It blends cached unigram statistics for recent words with 
the baseline n-grams using an interpolation scheme. 



Theoretically, it should not be necessary to rely on an ad hoc 
device such as a cache in order to model variable word occur- 
rences. All the parameters of a language model may be completely 
determined according to probabilistic model of word rate, such as 
a Poisson mixture. 

In this paper, we outline the theoretical background for mod- 
elling the variable word rate, and illustrate a key observation that 
word rates are not static using spoken data transcripts. The con- 
stant word rate assumption is then eliminated, and we introduce 
a variable word rate n-gram language model. An approach to es- 
timating relative frequencies using prior information of word oc- 
currences is presented. It is integrated with standard n-gram mod- 
elling that naturally involves discounting and smoothing schemes 
for practical use. Using the DARPA/NIST Hub^4E North Ameri- 
can Broadcast News task, the approach demonstrates the reduction 
of perplexity up to 10%. 

2. MODELLING VARIABLE WORD RATES 

In this section, we illustrate how the assumption of a constant word 
rate fails to capture the statistics of word occurrence in spoken 
(or written) documents. We show that the word rate is variable 
and may be modelled using a Poisson distribution or a continuous 
mixture of Poissons. 

2.1. Poisson Model 

The Poisson distribution is one of the most commonly observed 
distributions in both natural and social environments. It is funda- 
mental to the queueing theory: under certain conditions, the num- 
ber of occurrences of a certain event during a given period, or in a 
specified reg ion of space, follows a Poisson distribution (a Poisson 
process ||1C|]). 

By assuming randomness in a Poisson process, word rate is no 
longer uniform. Firstly, we provide a loose definition of a doc- 
ument as a unit of spoken (or written) data of a certain length 
that contains some topic(s), or content(s). We consider a model 
in which a word occurs at random in a fixed length document. For 
a set of documents we assume that each document produces this 
word independently and that the underlying process is the Poisson 
with a single parameter A > 0. 

Formally, a Poisson distribution is a discrete distribution (of a 
random variable X) which is defined for x = 0, 1, • • • such that 



j[p] 



(x) = V(X = x; A) = 



s A' r 
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whose expectation and variance are given by E[X] 
V[X] = A, respectively Q. 
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Figure 1 : The occurrence of words (unigrams) varies between doc- 
uments. Histograms show the number of word occurrences for 
'FOR', 'YOU', 'OF', and 'CHURCH' from a set of 2583 docu- 
ments, each containing at least 100 words (average: 497 words). A 
negative binomial distribution (solid line) was used to approximate 
each histogram. The number of word occurrences were normalised 
to 1000-word length documents. 



2.2. Poisson Mixture — Negative Binomial Model 

A less constrained model of variable word rate is offered by a mul- 
tiple of Poissons, rather than a single Poisson. 

Suppose the parameter A of the pdf (|l|) is distributed according 
to some function 0(A), then we define a continuous mixture of 
Poisson distributions by 



0(x)= / e lp] {x)(/ ) (X)d\. 
Jo 

In particular, if 0(A) is a gamma distribution, i.e., 
0(A) = g(X: a, f3) = —— 



(2) 



(3) 



for a > and /3 > 0, then the integral (g|) is reduced to a discrete 
distribution for x = 0, 1, • • • such that 



6 lnb] (x) = J\fB(X = x; a, (3) 



a + x — 1 
x 



(l+/3) c 



(4) 



This 6^- nb \x) is a negative binomial distribution^ and its expec- 
tation and variance are respectively given by E[X] = a/3 and 
V[X] = a0(J3 + l). 

1 Let 0(A) be Q(\; a, (3) in (El). This integration is straightforward 
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Figure 2: Variable bigram occurrence rates. Histograms show the 
number of bigram occurrences (in normalised 1000-word docu- 
ments) for 'FOR YOU' and 'OF CHURCH', combinations of uni- 
grams used in figure |l|. They are fitted by negative binomial distri- 
butions (solid lines). 



2.3. Word Occurrences in Documents 

The histograms in figure [l] show the number of word (unigram) 
occurrences in spoken news broadcast, taken from transcripts of 
the Hub~4E Broadcast News acoustic training data (1996-97). 
These transcripts were separated into documents according to sec- 
tion markers and those with less than 100 words were removed, 
resulting 2583 documents containing slightly less than 1.3 million 
words in total. In the following, the number of word occurrences 
were normalised to 1000-word length documents. 

'FOR' and 'YOU' appeared approximately the same number 
of times across all the transcripts. Using a constant word rate as- 
sumption, they would have been assigned a probability of around 
0.0086. However their occurrence rates varied from document to 
document; about 11% and 33% of all documents did not contain 
'FOR' and 'YOU' (respectively), while 1% and 3% contained these 
words more than 30 times. This seems to indicate that occurrences 
of 'FOR' is less dependent on the content of the document. A neg- 
ative binomial distribution was used to model the variable word 
rate in each case (the solid line in figure [l]). 

The negative binomial seems to model word occurrence rate 
relatively well for most vocabulary items, regardless of frequency. 
Figure fy illustrates this for one of the most frequent words 'OF' 
(probability of 0.023 according to the constant word rate as- 
sumption) and the less frequently occurring 'CHURCH' (less than 
0.00029). In particular, 'CHURCH' appeared only in 93 out of 
2583 documents, but 28 of them contained more than 10 instances, 
suggesting strong correlation with document content. 

We also collected statistics of bigrams appearing in the Broad- 
cast News transcripts. Figure ^ show histograms and their neg- 
ative binomial fits for bigrams 'FOR YOU' and 'OF CHURCH'. 
Although very sparse (e.g., they appeared in 127 and 6 documents, 
respectively), this suggests that variable bigram rate can also be 
modelled using a continuous mixture of Poissons. 



using the definition of the gamma function, F(a) = J t a ^ 1 e~ t dt, 

and the recursion, T(a + 1) = aF(a). The resultant pdf (^) has a slightly 
unconventional form in comparison to that in most of standard textbooks 

(e.g., |jn]]), but is identical by setting a new parameter 7 = with 

< 7 < 1. 



1 + 13 



3. VARIABLE WORD RATE LANGUAGE MODELS 

Taking word occurrence rate into account changes a probabilistic 
language model from a situation akin to playing a lottery, to some- 
thing closer to betting on a horse race: the odds for a certain word 
improve if it has come up in the past. In this section, we eliminate 
the constant word rate assumption and present a variable word rate 
71-gram language model. 

3.1. Relative Frequencies with Prior Word Occurrences 

Let f(w > n w ) denote a relative frequency after we observe n w 
occurrences of word w. It is calculated by 



f(w >n w ) = 



£ 

3=0 



N 



n w -l 

£< 

3=0 



(5) 



The function is defined for n w = 0, 1, • • • , N, where TV is a fixed 
document length (e.g., N is normalised to 1000 in figures [l] and ^). 
8w{j) is the occurrence rate for word w in an TV-length document 
(e.g., Poisson, negative binomial), satisfying 



N 
j=0 



N 

£« 

3=0 



.0") 



In particular, 



/(»>0) = ^ 



(6) 



which corresponds to the case with no prior information of word 
occurrence. For the conventional approach with the constant word 
rate assumption, this f(w > 0) is not modified regardless of any 
word occurrences. Further, function (^) satisfies our intuition; the 
value of f(w > n m ) increases monotonically as the number of 
observation n w accumulates (easy to verify), and it reaches a unity 
(T) when n w — N. 

The characteristics of function (jH)) are illustrated in figure ^. 
The right hand figure shows relative frequencies for 'OF' and 
'CHURCH' after a certain number of previous observations of the 
word. It indicates that the first few instances of the frequent word 
('OF') do not modify its relative frequency very much, but have 
a substantial effect on the relative frequency of the less common 
word ('CHURCH'). As the number of observations increases, the 
former is caught up by the latter. 

Finally, in order to convert this relative frequency model to any 
type of probabilistic model for language, normalisation is required. 
This is achieved by dividing f(w > n w ) by S ' f(w > n w ), 

where V implies a set of vocabulary. Variable relative frequencies 
for bigrams can also be calculated in a similar fashion. 

3.2. Discounting and Smoothing Techniques 

For any practical application, smoothing of the probability esti- 
mates is essential to avoid zero probabilities for events that were 
not observed in the training data. Let £(w\v) denote a bigram en- 
try (a word v followed by w) in the model. Further, f(w\v > 
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Figure 3: The left figure shows word occurrence rates for 'OF' and 
'CHURCH' in documents of (normalised to) 1000- word length, 
modelled by negative binomial distributions (identical to those in 
figure |l|). The right figure demonstrates relative frequencies after 
a certain number of word occurrences. Circles ('0') correspond 
to relative frequencies under the constant word rate assumption 
(0.023 for 'OF' and 0.00029 for 'CHURCH'). 



n w i v ) implies a relative frequency after we observe n w u occur- 
rences of the bigram. A bigram probability p(w\v > n w i v ) may 
be smoothed with a unigram probability p(w > n w ). Using the 
interpolation method mM : 

p(w\v > n ro |„) = f(w\v > n w \ v ) 

+ {1 - a(v)} -p(w > n w ) (7) 

where f(w\v > n w \ v ) implies a "discounted" relative frequency 
(described later) and 



a(v) 



(8) 



is a non-zero probability estimate (i.e., the probability that a bi- 
gram entry £(w\v) exists in the model). Alternatively, the back-off 
smoothing Jl 3f] may be applied: 



p(w\v > n„ 



f(w\v > n w i v ) if £(w\v) exists, 
/3(v) ■ p(w > n w ) otherwise. 

In (^), j3(v) is a back-off factor and is calculated by 

1 — a(y) 



/?(«) = 



w £ £ ( w | v ) 



f(w > n w ) 



(9) 



(10) 



A unigram probability p(w > n w ) can be obtained similarly by 
smoothing with some constant value. 

Finally, a number of standard discount ing methods exist for 
constant word rate models (see, e.g., jl3[ |lj]). Analogous dis- 
counting functions for variable word rate models may be 

c 

fabs(w\v > n w \ v ) = f(w\v > n w \ v ) - — (11) 



N 



for the absolute discounting, and 

f g t(w\v > n w \ v ) = d ■ f(w\v > n w \ v ) 



(12) 



for the Good-Turing discounting. Discounting factors (c and 
d) may be obtained using zero prior information case — i.e., 
f(w\v > 0)'s of all bigrams in the model — and the rest should 
be referred to, e.g., [Oh or H. 
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Figure 4: Unigram and bigram perplexities for the reference (key) 
transcription of 1997 Hub~4E evaluation data. Conventional mod- 
els (constant word rate) are compared with models using Poisson 
estimates of variable word rate. Document length for the latter was 
normalised to between 200 and 50 000. 



3.3. Language Model Perplexities 

As noted in section we extracted 2583 documents from the tran- 
scripts of the Broadcast News acoustic training data, each with a 
minimum of 100 words. A vocabulary of 19 885 words was se- 
lected and 390000 bigrams were counted. In these experiments, 
the absolute discounting scheme (In]) was applied, followed by in- 
terpolation smoothing (M). Figure Wshows perplexities for the ref- 
erence (key) transcription of the 1997 Hub~4E evaluation data, 
containing three hours of speech and approximately 32 000 words. 
Using conventional modelling with a constant word rate assump- 
tion, unigram and bigram perplexities were 936.5 and 237.9, re- 
spectively. 

For the variable word rate models, the Poisson distribution was 
adopted because of simplicity in calculation. The number of word 
occurrences were normalised to iV-word length document with N 
being between 200 and 50000, and the model parameters were 
modified 'on-line' during the perplexity calculation. For each oc- 
currence of a word (bigram) in the evaluation data, a histogram 
of the past N words (bigrams) was collected and their relative fre- 
quencies were modified according to the Poisson estimates (appro- 
priate normalisation applied), then discounted and smoothed. 

As figure Q indicates, the variable word rate models were able 
to reduced perplexities from the constant word rate models. A 
unigram perplexity of 843.4 (10% reduction) was achieved when 
N = 500, and a bigram perplexity of 219.0 (8% reduction) when 
N = 50 000. The difference was predictable because bigrams 
were orders of magnitude more sparse than unigrams. 

4. CONCLUSION 

In this paper, we have presented a variable word/n-gram rate lan- 
guage model, based upon an approach to estimating relative fre- 
quencies using prior information of word occurrences. Poisson 
and negative binomial models were used to approximate word oc- 
currences in documents of fixed length. Using the Broadcast News 
task, the approach demonstrated a reduction of perplexity up to 
10%, indicating potential although the technique is still prema- 
ture. Because of the data sparsity problem, it is not clear if the 
approach can be applied to language model components of current 
state-of-the-art speech recognition systems that typically use 3/4- 



grams. However, we believe this technique does have application 
to problems in the area of information extraction. In particular, we 
are planning to apply these methods to the named entity annotation 
task, along with further theoretical development. 
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